R is an open source programming language that is popular for
statistics and data analysis. R was built by statisticians, so many
common statistical operations are built into the base language. R also
features powerful and intuitive libraries for plotting and a variety of
packages for predictive modeling, making it one of the most popular
languages for data science.
Data Types
Decimal numbers (real numbers) in R are known as doubles. Doubles are
the default numeric data type so when you manually enter a number in R,
you are working with a double.
typeof(1)
[1] "double"
typeof(-10.5)
[1] "double"
typeof(Inf)
[1] "double"
typeof(-Inf)
[1] "double"
typeof(NaN) # dividing zero by zero produces NaN
[1] "double"
Integers are a second numeric data type that only take whole numbered
values.
as.integer(1) # Convert the double 1 to integer 1
[1] 1
typeof(as.integer(1))
[1] "integer"
# Convert back from integer to double
as.numeric(as.integer(1))
[1] 1
Our first non-numeric data type is the Logical. A Logical takes on
the value of TRUE or FALSE. You must type TRUE and FALSE in all capital
letters for R to recognize them as logical values. Data that only takes
on the values of True or False are also called “Booleans”.
typeof(TRUE)
[1] "logical"
typeof(FALSE)
[1] "logical"
typeof(T)
[1] "logical"
typeof(F)
[1] "logical"
You can create logical values with logical comparisons. R supports a
variety of standard logic comparison operators including > (greater
than), < (less than), >= (greater than or equal), <= (less than
or equal).
20 > 20
[1] FALSE
20 >= 20
[1] TRUE
10 == 10
[1] TRUE
10 != 20
[1] TRUE
!FALSE
[1] TRUE
(2 > 1) & (10 == 9)
[1] FALSE
(2 > 1) | (10 == 9)
[1] TRUE
Strings of text in R are known as characters. Surround text with
quotation marks to create a character.
typeof("cat")
[1] "character"
typeof("1")
[1] "character"
typeof(as.numeric("12"))
[1] "double"
typeof(as.character(12))
[1] "character"
Simple Arithmetic Operations¶
12 + 6 # addition
[1] 18
12 - 6 # subtraction
[1] 6
12 * 6 # multiplication
[1] 72
12 / 6 # division
[1] 2
12^6 # exponentiation
[1] 2985984
12**6 # exponentiation
[1] 2985984
12 %% 6 # modulo (get remainder)
[1] 0
Variables
A variable is a name you assign a value or object. After assigning a
variable, you can access its associated value or object using the
variable’s name. To simply put, this is how we store data. In R, assign
variables using <- (the less than sign followed by a hyphen.).
var <- 3 + 3
x <- 10
y <- "R Workshop"
z <- (sqrt(144) == 12)
print(var)
[1] 6
print(x)
[1] 10
print(y)
[1] "R Workshop"
print(z)
[1] TRUE
It is possible to assign variables in R using the equals symbol =
instead of <-. The equal sign is used for variable assignment in many
other programming languages (such as Python). One reason for using <-
besides conforming to the style preferred by the R community is that the
equals sign is used in places other than variable assignment statements.
Functions often take named arguments and when calling a function you use
the = symbol to assign values to named arguments.
Vectors or Collections
A vector is a sequence of data elements of the same data type. You
can have numeric, logical and character vectors.
To create and store a vector with specific values, use the c()
function and assign the result to a variable. c() takes a comma
separated sequence of elements as input and combines them into a vector.
You can also combine two vectors using the c() function. If you try to
combine vectors of different types, R will automatically convert the
vectors into the type that fits best.
# Creating a character vector for the days of the week
weekday <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
weekend <- c("Saturday", "Sunday")
days <- c(weekday, weekend)
print(days)
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday" "Saturday" "Sunday"
Vector Indexing Starts at 1
The first element is at index position 1, the second element is at
index position 2 and so on.
*Note: unlike many other programming languages, indexes in R start at
1 instead of 0.
When you print a vector to the screen, each line starts with a number
in square brackets followed by vector values. The number in square
brackets indicates the index of the next value listed on that line.
You can access a specific value in a vector by typing the name of the
vector and then wrapping the index associated with the value you want to
access in square brackets.
days[1]
[1] "Monday"
Range of Values: Inclusive
You can access ranges of values by placing a colon between the
starting and ending indices of the range:
days[1:3]
[1] "Monday" "Tuesday" "Wednesday"
Pulling out specific elements from Vectors
days[c(1, 3, 5, 7)]
[1] "Monday" "Wednesday" "Friday" "Sunday"
Subset out of a Collection
A subset of a vector is just a shorter vector. You can access a
specific subset of values by wrapping a vector in the square
brackets.
weekdays <- days[1:5]
weekdays
[1] "Monday" "Tuesday" "Wednesday" "Thursday" "Friday"
Generate a vector using 100 random Numbers
random_data <- runif(100) # Create a vector of 100 random numbers
print(random_data)
[1] 0.601253516 0.719215520 0.428791696 0.954586146 0.002825221 0.912358903 0.094981659 0.465619302 0.210666870 0.902174364 0.396719641 0.322112673 0.511106455
[14] 0.521341122 0.363127518 0.366527548 0.527183666 0.685147211 0.519944513 0.591298793 0.061995111 0.302844349 0.880424201 0.365061312 0.463684499 0.804066397
[27] 0.148957152 0.287857758 0.093952734 0.758151887 0.103074430 0.892897486 0.579773317 0.798398663 0.603510989 0.241194112 0.231561340 0.785641948 0.750667124
[40] 0.190808424 0.629992959 0.855862409 0.412287503 0.960019909 0.404498031 0.570024413 0.959659815 0.174513218 0.053899949 0.672869894 0.142749767 0.654829003
[53] 0.636226679 0.926860283 0.955050981 0.599153793 0.263378433 0.476719820 0.997212063 0.814981470 0.173799206 0.036244909 0.628301271 0.296626335 0.121297164
[66] 0.512687834 0.130345278 0.447331684 0.881440107 0.949472001 0.755979348 0.818791750 0.915813698 0.853843003 0.740594027 0.806554129 0.891963726 0.786478848
[79] 0.098086106 0.034598737 0.606487290 0.959551943 0.111269935 0.825509603 0.398758434 0.633334293 0.279467671 0.517837560 0.080050943 0.527193833 0.357329777
[92] 0.862411802 0.275236687 0.071194843 0.681278336 0.465154053 0.468364371 0.391514763 0.098308521 0.022964923
print(length(random_data))
[1] 100
Filtering of Vectors using Logical Expressions You
can also index a vector with a logical vector of the same length. In
this case, the subset is created from each index where the corresponding
logical vector is TRUE. Indexing with a logical vector is a common way
to filter a numeric or character vector for values that fulfill certain
criteria.
# Exclude everything except for your specified index
y <- c(1, 0, 3)
y <- y[-2]
y
[1] 1 3
# Exclude the range 2 to 9
random_data <- runif(50)
random_data_sub <- random_data[-(2:49)]
random_data_sub
[1] 0.2665789 0.3968453
# Exclude using logical expression
over_half <- (random_data > 0.5)
new_subset <- random_data[over_half]
new_subset
[1] 0.9513518 0.9196884 0.6220517 0.6799197 0.8695307 0.7120434 0.8860579 0.5192011 0.6553047 0.8639851 0.6851627 0.7975867 0.8037875 0.5572514 0.9958366 0.5052394
[17] 0.5590207 0.9668272 0.6954043 0.7933677 0.6393207 0.9396667 0.8633686 0.7923260 0.7717887 0.9555974 0.9679353 0.8015830 0.5403477 0.7434739
Use %in% to filter a vector
my_letters <- c("a", "b", "c", "d", "a", "c")
# Get only the a's and c's
my_letters[my_letters %in% c("a", "c")]
[1] "a" "c" "a" "c"
Vectorized Operations
Many R functions and operations behave in a “vectorized” manner,
meaning they act upon each element of a vector individually and return
the result of each of the operations in a new vector. Vectorized
operations simplify the process of performing the same calculations on
related data. All the basic operators and functions we’ve learned so far
that operate on single values work on vectors longer than length 1.
example_vector <- c(1, 2, 3)
# adds to each value in the vector
example_vector + 10
[1] 11 12 13
# performs subtraction on each value
example_vector - 10
[1] -9 -8 -7
Different Ways to Generate Vectors
x <- 1:20
y <- seq(from = 1, to = 20, by = 2)
r <- rep(1, times = 10)
s <- runif(n = 5, min = 0, max = 100)
x
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
y
[1] 1 3 5 7 9 11 13 15 17 19
r
[1] 1 1 1 1 1 1 1 1 1 1
s
[1] 74.66618 93.12643 14.42880 74.82574 60.27581
Control Flow
If, Else and Else If
An if statement checks whether some logical expression is true or
false and executes a specified block of code if the logical expression
is true.
In R, an if statement starts with if, followed by a logical
expression in parentheses, followed by the code to execute when the if
statement is true in curly braces.
If statements are often accompanied by else statements. Else
statements come after if statements and allow you to execute code in the
event that the logical expression of an if statement is false.
x <- 5
if (x > 0) {
print("Positive number")
} else {
print("Negative number")
}
[1] "Positive number"
For Loops
For loops are a programming construct that let you go through each
item in a sequence and then perform some operation on each one.
my_sequence <- seq(0, 100, 10)
# Create a new loop over the specified items
for (item in my_sequence) {
print(item)
}
[1] 0
[1] 10
[1] 20
[1] 30
[1] 40
[1] 50
[1] 60
[1] 70
[1] 80
[1] 90
[1] 100
The next keyword causes a for loop to skip to the next
iteration of the loop.
my_sequence <- seq(0, 100, 10)
for (item in my_sequence) {
if (item < 50) { # this if statement skips items less than 50
next
}
print(item)
}
[1] 50
[1] 60
[1] 70
[1] 80
[1] 90
[1] 100
The break keyword halts the execution of the loop entirely.
Use break to break out of a loop.
my_sequence <- seq(0, 100, 10)
for (item in my_sequence) {
if (item > 50) {
break
}
print(item)
}
[1] 0
[1] 10
[1] 20
[1] 30
[1] 40
[1] 50
While Loops
While loops are similar to for loops in that they allow you to
execute code over and over again. For loops execute their contents, at
most, a number of iterations equal to the length of the sequence you are
looping over. While loops, on the other hand, keep executing
their contents as long as a certain logical expression you supply
remains true.
x <- 5
iterations <- 0
# Execute as long as iterations < x
while (iterations < x) {
print("Study")
iterations <- iterations + 1 # Increment iterations by 1 each time the loop executes
}
[1] "Study"
[1] "Study"
[1] "Study"
[1] "Study"
[1] "Study"
If Else on Vectors
For example, imagine you have a vector of numbers and you want to set
all the negative values in the vector to zero. One way to do it is to
use a for loop with an inner if statement.
my_vect <- runif(25, -1, 1) # Generate some random data between -1 and 1
for (index in 1:length(my_vect)) { # loop through the sequence 1:25
number <- my_vect[index] # look up the next number using indexing
if (number < 0) { # check if the number is less than 0
my_vect[index] <- 0 # if so, set it to 0
}
}
print(my_vect)
[1] 0.0000000 0.0000000 0.9865759 0.3029479 0.5311695 0.0000000 0.8078285 0.0000000 0.6218120 0.0000000 0.0000000 0.0000000 0.3058018 0.0000000 0.0000000 0.5942243
[17] 0.0000000 0.0000000 0.0000000 0.0000000 0.6908745 0.4105624 0.6997950 0.0000000 0.8020392
Using a for loop requires writing quite a bit of code and loops are
not particularly fast.
Instead we could have used R’s ifelse() function to the same thing in
a vectorized manner. ifelse() takes a logical test as the first
argument, a value to return if the test is true as the second argument
and a value to return if the test is false as the third argument:
data <- c(11, 7, NA, 9, NA, 13, 15, NaN, 19, 17, 14, NaN)
# Use if else statements to conditionally fill these bad values with the mean
ifelse(is.na(data) | is.nan(data), # logical check
mean(data, na.rm = T), # value to set if TRUE
data
) # value to set if FALSE
[1] 11.000 7.000 13.125 9.000 13.125 13.000 15.000 13.125 19.000 17.000 14.000 13.125
# Chaining ifelse to perform multiple operations
data <- c(11, 7, NA, 9, NA, 13, 15, NaN, 19, 17, 14, NaN)
d <- ifelse(is.na(data) | is.nan(data), "missing", ifelse(data < 10, "low", ifelse(data < 15, "medium", "high")))
table(d)
d
high low medium missing
3 2 3 4
Functions
A function is just an R object that runs a per-defined snippet of
code, usually on some input that you supply to it. A function can return
an output based on the input you provide. For example, the sum()
function built into R simply takes a numeric vector as input and returns
their sum as output. Built-in functions and packages can take you a long
way in R, but it can be useful to define your own functions to perform
specific tasks outside the scope of built-in functions.
Create your own function in R using this syntax:
# Assign the function() to a name and declare arguments within ()
new_function <- function(arguments) {
# Write a function body within the {} to execute
for (x in 1:arguments) {
print("This is a function!")
}
}
Here is an actual example
exampleFunction <- function(x, y) {
c(x + 1, y + 10)
}
exampleFunction(2, 4)
[1] 3 14
Functions with Return Value
Functions in R return the last expression evaluated by default.
add_10 <- function(number) {
number + 10
# return (number+10)
}
add_10(5)
[1] 15
add_20 <- function(number) {
return(number + 20) # Exit and return specified value
number + 10 # The function exits before running this line
}
Function Arguments
A function can have one or more named arguments. You can assign a
default value to an argument when creating a function with the
argument_name = argument_value syntax.
sum_3_items <- function(x, y, z, print_args = TRUE) {
if (print_args) {
print(x)
print(y)
print(z)
}
return(x + y + z)
}
sum <- sum_3_items(1, 2, 3)
[1] 1
[1] 2
[1] 3
sum2 <- sum_3_items(10, 20, 30, print_args = FALSE)
Ellipsis
The … argument collects all extra arguments passed to a function that
are not matched. The … argument can be used in functions where the
number of arguments is not known in advance.
addition_function <- function(...) {
total <- 0
# list (...) extracts the arguments to a list
for (value in list(...)) {
# Add each argument in ... to the total
total <- total + value
}
total
}
addition_function(2, 4, 6, 8, 10, 12, 14)
[1] 56
